In [1]:
import lxml.etree
import os

0. Preparing Data

Before digging into the parser notebook, the version of the CWE xml file within this notebook is v3.0, which can be downloaded from this link. Here we loaded CWE v3.0 xml file. Therefore, if there is any new version of XML raw file, please make change for the following code. If the order of weakness table is changed, please change the code for function extract_target_field_elements in section 2.1.


In [2]:
cwe_xml_file='cwec_v3.0.xml'

1. Introduction

The purpose of this notebook is to build the fields parser and extract the contents from various fields in the CWE 3.0 XML file so that the field content can be directly analyzed and stored into database. Guided by CWE Introduction notebook, this notebook will focus on the detail structure under Weakness table and how parser functions work within the weakness table.

To preserve the semantic information and not lose details during the transformation from the representation on website and XML file to the output file, we build a 3-step pipeline to modularize the parser functions for various fields in different semantic format. The 3-step pipeline contains the following steps: searching XML Field node location, XML field node parser, and exporting the data structure to the output file based on the semantic format in Section 4 of CWE Introduction Notebook. More details will be explained in Section 2.

2. Parser Architecture

The overall parser architecture is constituted by the following three procedures: 1) extracting the nodes with the target field tag, 2) parsing the target field node to the representation in memory, and 3) exporting the data structure to the output file.

Section 2.1 explains the way to search XML field nodes with the target field tag. No matter parsing which field, the first step is to use Xpath and then locate all XML field nodes with the field tag we are intended to parse. The function in section 2.1 has been tested for all fields and thus can locate XML nodes with any given field naming. However, the function is inconsistent for different versions, since the order of weakness table might be different. It happens between v2.9 and v3.0.

Section 2.2 explains the way to parse and extract the content of the target field into the representation in memory. Since different fields have various nested structures in xml raw file and the content we will parse varies field by field, the worst situation is that there will be one parser function for each different field. However, from Section 4 in CWE Introduction Notebook, certain fields may share a same format on website, such as table or bullet list, the ideal situation is that we would have only 4 or 5 functions to represent the data in memory.

Section 3 addresses the way to export the data representation from Section 2.2. A set of functions in Section 3 should be equal to the number of data structures in Section 2.2.

2.1 XML Field Node Location

This function searches the tree for the specified field node provided (e.g. Potential_Mitigations) as input and returns the associated XML node of the field. The string containing the field name can be found in the Introductory Notebook's histogram on Section 4 . As it can be observed in that histogram, only certain fields are worthwhile parsing due to their occurrence frequency.


In [3]:
def extract_target_field_elements(target_field, cwe_xml_file):
    '''
    This function responsibility is to abstract how nodes are found given their field name and should be used together with the histogram.
    
    Args:
        - target_field: the arg defines which nodes are found given by the field that we are aiming to target
        - cwe_xml_file: the CWE xml file that this function will work and extract the target field nodes
    Outcome:
        - a list of nodes that have the pre-defined target field as the element tag
    '''
    # read xml file and store as the root element in memory
    tree = lxml.etree.parse(cwe_xml_file)
    root = tree.getroot()

    # Remove namespaces from XML.  
    for elem in root.getiterator(): 
        if not hasattr(elem.tag, 'find'): continue  # (1)
        i = elem.tag.find('}') # Counts the number of characters up to the '}' at the end of the XML namespace within the XML tag
        if i >= 0: 
            elem.tag = elem.tag[i+1:] # Starts the tag a character after the '}'
            
    # define the path of target field. Here we select all element nodes that the tag is the target field
    target_field_path='Weakness/./'+target_field
    # extract weakness table in the XML // if the order of weakness table is changed, please make change for the following code
    weakness_table = root[0]
    
    # generate all elements with the target field name
    target_field_nodes=weakness_table.findall(target_field_path)
    return target_field_nodes

2.2 XML Node Field Parser

Once the node is provided by the former function, its XML structure is consistent for that field, and cwe version, but it is inconsistent for different versions. For example, a table represented in XML is different than a paragraph in XML. However, even for the same expected structure such as a table, the XML may have different tags used within it.

The associated parser function then is tested to cover all possible cases of this field in the XML, and also interpreted against its .html format to understand its purpose. We again refer to the introductory notebook Sections 4 and 5 which convey respectively the potential usage of the field for PERCEIVE and the overall format of the structure when compared to others.

The purpose is then documented as part of the functions documentation (sub-sections of this Section), conveying the rationale behind what was deemed potentially useful in the content to be kept, as well as how certain tags were mapped to a data structure while being removed.

The parser function outputs one of the known data structures (i.e. memory representation) that is shared among the different fields of what is deemed relevant. For example, while 5 field nodes may have their relevant information in different ways, they may all be at the end of the day tables, named lists, or single blocks of texts. Sharing a common representation in this stage decouples the 3rd step of the pipeline from understanding the many ways the same information is stored in the XML.

Because different CWE versions may organize a .html table, bullet list etc in different ways even for the same field, this organization also decouples the previous and following section functions from being modified on every new version if necessary.

The following fields have parser functions as of today:

Field Name Function Name
Potential_Mitigations parse_potential_mitigations
Common_Consequences parse_common_consequences

2.2.1 Parse Potential_Mitigations

Potential_Mitigations field has a nested structure under the field element. To understand the nesting structure, here we use the following image for cwe-1022 as example. Under Potential_Mitigatations element, there are two mitigation entries named by 'Mitigation', which represent the way to mitigate the weakness in the development cycle. Under each mitigation node, there are multiple sub-enties that constitute one mitigation (phase and description in cwe-1022 example), which have the contents that our parser is intended to extract.

To preserve the named list format of Potential_Mitigations, we use the dictionary to pair the CWE id and the content we parse from Potential_Mitigations field. Since there are multiple mitigation methods to mitigate the weakness, a list of dictionaries will be used to store the content, where the number of dictionaries is equal to the number of mitigation methods. And then the tag and the corresponding content will be paired in each dictionary. In summary, the data structure in memory can be represented as the following format: {CWE_id: [{tag1:content1, tag2: content2..}, {tag1:content3, tag2:content4..}...]}. More details can be found in the example of cwe-1022.

There are two special cases when parsing Potential_Mitigations field:

1) Various sub-entries:

Some Mitigation nodes may contain more sub-entries, other than phase and description, such as strategy, effectiveness and effectiveness_notes. These entries can be found in cwe-1004 and cwe-106. In this case, the parser will store the tag and content as same as phase and description.

2) HTML tags under Description node:

In some cases, the content under Description node will be stored in multiple html elements, such as p, li, div, and ul. These html tags are used to separate the sentences from a paragraph. For example, there are two html elements <p> under the description of the second mitigation node in the following images. By comparing to how the contents are represented on the webiste, we conclude the tag <p> is not useful to be kept. Therefore, in this case, the parser will concatenate the content of description under the same mitigation node and remove the tag <p>.

Since the number of element varies depending on the CWE_id, here is the cardinality of these tags:

Tag Cardinality
Phase 1
Description 1
Strategy 0 or 1
Effectiveness 0 or 1
Effectiveness_Notes 0 or 1
  • How content represent on the website (cwe-1022)

  • How content represent in the xml file (cwe-1022)


In [4]:
def parse_potential_mitigations(potential_mitigation_node):
    '''
    The parser function concern is abstracting how the Potential_Mitigations field is stored in XML, 
    and provide it in a common and simpler data structure
    
    Args:
        - potential_mitigations_node: the node that has Potential_Mitigations tag, such as the above image
    
    Outcomes: 
        - A dictionary that pairs cwe_id as key and the mitigation list as value.  
          In the dictionary, the mitigation list will be a list of dictionaries that each dictionary pairs tag and the corresponding content for each mitigation.
          More details can be found in the following example for cwe-1022
    '''
    # extract cwe_id from the attribute of potential_mitigations element's parent node
    cwe_id=potential_mitigations_node.getparent().attrib.get('ID')
    cwe_id='CWE_'+cwe_id
    # the mitigation list that each element represents an indivudual mitigation element
    mitigation_list=[]
    target_field=potential_mitigations_node.tag
    
    # for each mitigation node under the potential_mitigations node
    for mitigation in list(potential_mitigations_node):

        # the dictionary that contain the information for each mitigation element
        mitigation_dict=dict()

        # traverse all mitigation_element nodes under each mitigation node
        for mitigation_element in list(mitigation):
            # generate tag and content of each mitigation_element
            mitigation_element_tag=mitigation_element.tag.lower()
            mitigation_element_content=mitigation_element.text
            
            ## in case there is nested elements under mitigation_element but store the content from a same tag
            # check whether there is an element under mitigation_element
            if mitigation_element_content.isspace():
                entry_element_content=''
                
                # iterate all child elements below mitigation_element, 
                for mitigation_element_child in mitigation_element.iter():
                    # extract the content
                    mitigation_element_child_content=mitigation_element_child.text
                    # if there is no content under the element or if this a nested element that contain one more element, then move to the next
                    if mitigation_element_child_content.isspace():
                        continue
                    # if not, merge the content
                    else:
                        mitigation_element_content+=mitigation_element_child_content
            # store the tag and content for each mitigation element to the dictionary
            mitigation_dict[mitigation_element_tag]=mitigation_element_content.strip()
        # add each mitigation element dictionary to mitigation_list
        mitigation_list.append(mitigation_dict)
    
    # pair the cwe_id with the mitigation contents
    potential_mitigations_dict=dict()
    potential_mitigations_dict[cwe_id]=mitigation_list
    return potential_mitigations_dict

Through function parse_potential_mitigations, the above Potential_Mitigations node for cwe_1022 will be parsed into the following data format in memory.

  • How content represent in memory (cwe-1022)

2.2.2 Parse Common_Consequences

Common_Consequences field has a nested structure under the field element. To understand the nesting structure, here we use the Common_Consequences field in cwe-103 as example. Under Common_Consequences element, there are two field entries named by 'Consequence', which represent two different consequences associated with the weakness. Under each consequence element, three entry elements constitute one weakness consequence, including scope, impact, and note, which have the contents that our parser is intended to to extract.

To preserve the table format of Common_Consequences, we use the dictionary to pair the CWE id and the content we parse from Common_Consequences field. Since there are multiple consequences for one weakness, a list of dictionaries will be used to store the content, where the number of dictionaries is equal to the number of consequences. Since one consequence may have multiple impacts and scopes but only one note, we use tuple to store the content of impact and scope, while directly store the content of note. In summary, the data structure in memory can be represented as the following format: {CWE_id: [{'Scope':scope tuple, 'Impact':impact tuple, 'Note': Text}, {'Scope':scope tuple, 'Impact':impact tuple, 'Note': Text}...]}. More details can be found in the example of cwe-103.

Since the number of element varies depending on the field, here is the cardinality of these fields:

Tag Cardinality
Scope 1 or more
Impact 1 or more
Note 0 or 1
  • How content represent on the website (cwe-103)

  • How content represent in the xml file (cwe-103)


In [5]:
def parse_common_consequences(common_consequences_node):
    '''
    The parser function concern is abstracting how the Common_Consequences field is stored in XML, 
    and provide it in a common and simpler data structure
    
    Args:
        - common_consequences_node: the node that has Common_Consequences tag, such as the above image
    
    Outcomes: 
        - A dictionary that pairs cwe_id as key and the consequence list as value.  
          In the dictionary, the consequence list will be a list of dictionaries that each dictionary pairs tag and the corresponding content for each consequence.
          More details can be found in the following example for cwe-103.
    '''
    # extract cwe_id from the attribute of common_consequences element's parent node
    cwe_id=common_consequences_node.getparent().attrib.get('ID')
    cwe_id='CWE_'+cwe_id
    # the consequence list that each element represents an indivudual consequence element
    consequence_list=[]
    target_field=common_consequences_node.tag
    
    # for each consequence node under the common_consequence node
    for consequence in list(common_consequences_node):

        # the dictionary that contain the information for each consequence element
        consequence_dict=dict()

        # traverse all consequence_element nodes under each consequence node
        for consequence_element in list(consequence):
            # generate tag and content of each consequence_element
            consequence_element_tag=consequence_element.tag.lower()
            consequence_element_content=consequence_element.text.strip()
            
            # parse the note content directly as the value
            if consequence_element_tag=='note':
                consequence_dict[consequence_element_tag]=consequence_element_content
            # for scope and impact, parse the content for scope and impact as tuple
            else:
                # if the tag is already in the dictionary, add the content to the existing tuple
                if consequence_element_tag in consequence_dict:
                    consequence_dict[consequence_element_tag]+=(consequence_element_content,)
                # if not, create a tuple to contain the content
                else:
                    consequence_dict[consequence_element_tag]=(consequence_element_content,)
                           
        # add each consequence element dictionary to conisequence_list
        consequence_list.append(consequence_dict)
    
    # pair the cwe_id with the consequence contents
    common_consequences_dict=dict()
    common_consequences_dict[cwe_id]=consequence_list
    return common_consequences_dict

Through function parse_common_consequences, the above Common_Consequences node for cwe_103 will be parsed into the following data format in memory.

  • How content represent in memory (cwe-103)

3. Export Data Structure

At the point this notebook is being created, it is still an open end question on how will tables, bullet lists and other potential structures in CWE will be used for topic modeling. For example, should rows in a table be concatenated into a paragraph and used as a document? What if the same word is repeated as to indicate a lifecycle?

In order to keep this notebook future-proof, this section abstracts how we will handle each field representation (e.g. table, bullet list, etc) from the memory data structure. It also keeps it flexible for multi-purpose: A table may be parsed for content for topic modeling, but also for extracting graph relationships (e.g. the Related Attack Pattern and Related Weaknesses fields contain hyperlinks to other CWE entries which could be reshaped as a graph).


In [7]:
def export_data(target_field_node):
    '''This section code will be done in the future.'''
    pass

4. Main Execution

After developing the 3 steps parsing pipeline, this section will combine these 3 steps and produce the output file for different fields. As introduced in Section 2, although the parsing procedure keeps same for all fields, each field will have own parsing function, while the same format of fields may share a same exporting function. As a result, the main execution varies for each field.

4.1 Main execution for Potential_Mitigations

The main execution will combine the above 3 steps parsing pipeline for Potential_Mitigations. After developing function export_data, the following code should produce the output file that contains the parsed content of Potential_Mitigations for all CWE_id.


In [9]:
if __name__ == "__main__":
    # extract the nodes, whose tag is Potential_Mitigations,from cwe_xml_file
    potential_mitigations_nodes=extract_target_field_elements('Potential_Mitigations',cwe_xml_file)
    # read each Potential_Mitigation node
    for potential_mitigations_node in potential_mitigations_nodes:
        # parse the content for each potential_mitigation node
        potential_mitigations_info=parse_potential_mitigations(potential_mitigations_node)
        # export the parsed content TO-DO
        export_data(potential_mitigations_info)

4.2 Main execution for Common_Consequences

The main execution will combine the above 3 steps parsing pipeline for Common_Consequences. After developing function export_data, the following code should produce the output file that contains the parsed content of Potential_Mitigations for all CWE_id.


In [11]:
if __name__ == "__main__":
    # extract the nodes, whose tag is Common_Consequences, from cwe_xml_file
    common_consequences_nodes=extract_target_field_elements('Common_Consequences',cwe_xml_file)
    # read each Common_Consequences node
    for common_consequences_node in common_consequences_nodes:
        # parse the content for each common_consequence node
        common_consequence_info=parse_common_consequences(common_consequences_node)
        # export the parsed content  TO-DO
        export_data(common_consequence_info)